Clusters of Countries
Clusters of Countries
Clusters of Countries
Data Used
The data used is Population Growth, Fertility and Mortality Indicators.csv, tells about the number of some variables related to population fertility and mortality of each country around the world.
We have some variables from the data, and they are :
T03The country codePopulation.growth.and.indicators.of.fertility.and.mortalityThe country listXThe year columnX.1Variable which contains some indicators, this variable is going to be spread to some variables.X.2The values of the observations.X.3FootnotesX.4Data source
The Goal
Assume that we are going to classify countries listed based on the indicators contained in the data.
The Flow
Libraries Importing and Data Preparation.
Exploratory Data Analyst
PCA Transformation.
Biplotting and Interpretation.
Libraries Importing and Data Preparation
Libraries Used
library(tidyverse) # Data manipulating
library(FactoMineR) # performing PCA
library(factoextra) # Creating some plot (biplot)
library(plotly) # Interactive plotting
library(tibble) # Creating rownames
library(corrplot) # Creating correlation plot
library(countrycode) #generating continent variable from country code
library(gganimate) #animating plotData Importing
mortal_total <- read.csv("source/Population Growth, Fertility and Mortality Indicators.csv")
str(mortal_total)## 'data.frame': 4979 obs. of 7 variables:
## $ T03 : Factor w/ 264 levels "1","100","104",..: 264 1 1 1 1 1 1 1 1 1 ...
## $ Population.growth.and.indicators.of.fertility.and.mortality: Factor w/ 264 levels "","Afghanistan",..: 1 238 238 238 238 238 238 238 238 238 ...
## $ X : Factor w/ 18 levels "2000","2001",..: 18 6 6 6 6 6 6 6 11 11 ...
## $ X.1 : Factor w/ 8 levels "Infant mortality for both sexes (per 1,000 live births)",..: 7 6 8 1 5 2 4 3 6 8 ...
## $ X.2 : Factor w/ 1031 levels "-0.1","-0.2",..: 1031 43 191 480 289 725 701 752 42 191 ...
## $ X.3 : Factor w/ 45 levels "","Break in the time series.",..: 44 11 11 11 1 11 11 11 11 11 ...
## $ X.4 : Factor w/ 5 levels "Source","United Nations Population Division, New York, World Population Prospects: The 2017 Revision, last accessed June 2017.",..: 1 2 3 4 5 3 3 3 2 3 ...
We only need some variables to process the data, the last 2 columns and the first column will be eliminated
There is a year column (from 2000 to 2016 ), most of the countries only have values for 2005, 2010, and 2015.
The
X.1contains 8 indicators, we’re going to spread them into their own column
Data cleaning
- In the chunk below we’re going to remove the last 2 variables and filter the year, we only need the 2015 data to interpret the latest condition of each country.
mortal_lastyears <- mortal_total %>%
select(-c(X.3,X.4)) %>%
filter(X == 2015) %>%
spread(key = X.1, value = X.2) %>% # create column for each indicator in X.1
rename(year = X,
Code = T03,
Country = Population.growth.and.indicators.of.fertility.and.mortality,
inf.mort = `Infant mortality for both sexes (per 1,000 live births)`,
life.exp.both = `Life expectancy at birth for both sexes (years)`,
life.exp.male = `Life expectancy at birth for males (years)`,
life.exp.female = `Life expectancy at birth for females (years)`,
maternal.mortality.ratio = `Maternal mortality ratio (deaths per 100,000 population)`,
pop.increase = `Population annual rate of increase (percent)`,
tot.fertil.rate = `Total fertility rate (children per women)`) %>%
mutate(inf.mort = as.numeric(as.character(inf.mort)),
life.exp.both = as.numeric(as.character(life.exp.both)),
life.exp.female = as.numeric(as.character(life.exp.female)),
life.exp.male = as.numeric(as.character(life.exp.male)),
maternal.mortality.ratio = as.numeric(as.character(maternal.mortality.ratio)),
pop.increase = as.numeric(as.character(pop.increase)),
tot.fertil.rate = as.numeric(as.character(tot.fertil.rate)))## Warning: NAs introduced by coercion
head(mortal_lastyears)## Code Country year inf.mort life.exp.both
## 1 1 Total, all countries or areas 2015 35.0 70.8
## 2 100 Bulgaria 2015 8.3 74.3
## 3 104 Myanmar 2015 45.0 66.0
## 4 108 Burundi 2015 77.9 56.1
## 5 11 Western Africa 2015 70.5 54.7
## 6 112 Belarus 2015 3.6 72.1
## life.exp.female life.exp.male maternal.mortality.ratio pop.increase
## 1 73.1 68.6 216 1.2
## 2 77.8 70.8 11 -0.6
## 3 68.3 63.7 178 0.9
## 4 58.0 54.2 712 3.0
## 5 55.6 53.9 NA 2.7
## 6 77.7 66.5 4 0.0
## tot.fertil.rate
## 1 2.5
## 2 1.5
## 3 2.3
## 4 6.0
## 5 5.5
## 6 1.6
Country = Country list ; inf.mort = Infant mortality for both sexes (per 1,000 live births) ; life.exp.both = Life expectancy at birth for both sexes (years) ; life.exp.male = Life expectancy at birth for males (years) ; life.exp.female = Life expectancy at birth for females (years) ; maternal.mortality.ratio = Maternal mortality ratio (deaths per 100,000 population) ; pop.increase = Population annual rate of increase (percent) ; tot.fertil.rate = Total fertility rate (children per women)
NA checking
mortal_lastyears %>%
is.na() %>%
colSums()## Code Country year
## 0 0 0
## inf.mort life.exp.both life.exp.female
## 31 31 29
## life.exp.male maternal.mortality.ratio pop.increase
## 29 73 0
## tot.fertil.rate
## 29
There are so many NAs in the data, it means that not all country listed have the data we need.
- We’re going to replace the NAs to the average value of each variable/indicator.
# Assigning the average value of each variables
life.exp.both.avg <- mean(mortal_lastyears$life.exp.both,na.rm = T)
life.exp.male.avg <- mean(mortal_lastyears$life.exp.male, na.rm = T)
life.exp.female.avg <- mean(mortal_lastyears$life.exp.female,na.rm = T)
inf.mort.avg <- mean(mortal_lastyears$inf.mort,na.rm = T)
maternal.mortality.ratio.avg <- mean(mortal_lastyears$maternal.mortality.ratio,na.rm = T)
pop.increase.avg <- mean(mortal_lastyears$pop.increase,na.rm = T)
tot.fertil.rate.avg <- mean(mortal_lastyears$tot.fertil.rate,na.rm = T)
# Replacing the NAs with the avg value
mortal_lastyears <- mortal_lastyears %>%
mutate(inf.mort = replace_na(inf.mort,inf.mort.avg),
life.exp.both = replace_na(life.exp.both,life.exp.both.avg),
life.exp.female = replace_na(life.exp.female,life.exp.female.avg),
life.exp.male = replace_na(life.exp.male,life.exp.male.avg),
maternal.mortality.ratio = replace_na(maternal.mortality.ratio,maternal.mortality.ratio.avg),
pop.increase = replace_na(pop.increase,pop.increase.avg),
tot.fertil.rate = replace_na(tot.fertil.rate,tot.fertil.rate.avg))
head(mortal_lastyears)## Code Country year inf.mort life.exp.both
## 1 1 Total, all countries or areas 2015 35.0 70.8
## 2 100 Bulgaria 2015 8.3 74.3
## 3 104 Myanmar 2015 45.0 66.0
## 4 108 Burundi 2015 77.9 56.1
## 5 11 Western Africa 2015 70.5 54.7
## 6 112 Belarus 2015 3.6 72.1
## life.exp.female life.exp.male maternal.mortality.ratio pop.increase
## 1 73.1 68.6 216.0000 1.2
## 2 77.8 70.8 11.0000 -0.6
## 3 68.3 63.7 178.0000 0.9
## 4 58.0 54.2 712.0000 3.0
## 5 55.6 53.9 162.1842 2.7
## 6 77.7 66.5 4.0000 0.0
## tot.fertil.rate
## 1 2.5
## 2 1.5
## 3 2.3
## 4 6.0
## 5 5.5
## 6 1.6
There is an odd thing on the data as we replace the NA with the average number of each column. There are some rows/countries which have no observation value or only have 1 or 2 value for their indicator and we have filled them with the average values and it’s not supposed to be like that. We supposed to eliminate them.
- eliminating some rows
# we will create a vector that indicates wether a rows' values are mostly the avg values of each column or not
cb <-
mortal_lastyears$inf.mort == inf.mort.avg &
mortal_lastyears$life.exp.both == life.exp.both.avg &
mortal_lastyears$life.exp.male == life.exp.male.avg &
mortal_lastyears$life.exp.female == life.exp.female.avg &
mortal_lastyears$maternal.mortality.ratio == maternal.mortality.ratio.avg &
mortal_lastyears$tot.fertil.rate == tot.fertil.rate.avg
# join the vector to the data
mortal_lastyears <- cbind(mortal_lastyears, cb)
# eliminating the rows have mostly the average values in its columns
mortal_lastyears <- mortal_lastyears %>%
filter(cb == F)
mortal_lastyears <- mortal_lastyears[,-c(1,3,11)]
str(mortal_lastyears)## 'data.frame': 235 obs. of 8 variables:
## $ Country : Factor w/ 264 levels "","Afghanistan",..: 238 36 157 38 258 23 40 5 41 42 ...
## $ inf.mort : num 35 8.3 45 77.9 70.5 3.6 29.9 27.7 67.5 4.7 ...
## $ life.exp.both : num 70.8 74.3 66 56.1 54.7 72.1 67.6 75.3 56.4 81.8 ...
## $ life.exp.female : num 73.1 77.8 68.3 58 55.6 77.7 69.6 76.5 57.7 83.8 ...
## $ life.exp.male : num 68.6 70.8 63.7 54.2 53.9 66.5 65.5 74.1 55.1 79.7 ...
## $ maternal.mortality.ratio: num 216 11 178 712 162 ...
## $ pop.increase : num 1.2 -0.6 0.9 3 2.7 0 1.6 2 2.7 1 ...
## $ tot.fertil.rate : num 2.5 1.5 2.3 6 5.5 1.6 2.7 3 5 1.6 ...
Continent Column
I think by giving the Continent column, we’re going to have some more insights, so let’s just do it.
mortal_lastyears$Continent <- countrycode(sourcevar = mortal_lastyears[,"Country"],
origin = "country.name",
destination = "continent")## Warning in countrycode(sourcevar = mortal_lastyears[, "Country"], origin = "country.name", : Some values were not matched unambiguously: Africa, Asia, Australia and New Zealand, Caribbean, Central America, Central Asia, Channel Islands, Eastern Africa, Eastern Asia, Eastern Europe, Europe, Latin America & the Caribbean, Melanesia, Micronesia, Middle Africa, Northern Africa, Northern America, Northern Europe, Oceania, Other non-specified areas, Polynesia, South America, South-central Asia, South-eastern Asia, Southern Africa, Southern Asia, Southern Europe, Sub-Saharan Africa, Total, all countries or areas, Western Africa, Western Asia, Western Europe
## Warning in countrycode(sourcevar = mortal_lastyears[, "Country"], origin = "country.name", : Some strings were matched more than once, and therefore set to <NA> in the result: Australia and New Zealand,Oceania,Oceania
so some rows cannot be defined by its continent and all of them are not even a country actually. They are just regions or certain areas of the continent.
Our observations are countries so we wil just eliminate rows that represent some areas or regions.
mortal_lastyears <- mortal_lastyears %>%
mutate(Continent = replace_na(Continent, "?"))
mortal_lastyears <- mortal_lastyears %>%
filter(Continent != "?") %>%
mutate(Continent = as.factor(Continent))- We better assign the
Countryas rownames instead.
mortal_lastyears <- column_to_rownames(mortal_lastyears,"Country")Now the data is ready to be proceed.
Exploratory Data Analyst
- Variables Correlation
corrplot(cor(mortal_lastyears[,-8]), method = "circle") From the plot above we can conclude that :
- the correlation between life expectancy of birth of male, female, and both are really high. In this case we better use the
life expectancy of both
mortal_lastyears <- mortal_lastyears[,-c(3,4)]all variables have relatively strong correlation to each other but
pop.increasethe
pop.increasehas the least correlation with other variables
Life Expectantion of the World
g <- ggplot(mortal_lastyears, aes(life.exp.both, inf.mort, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g)g6 <- ggplot(mortal_lastyears, aes(life.exp.both, pop.increase, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g6)g7 <- ggplot(mortal_lastyears, aes(life.exp.both, maternal.mortality.ratio.avg, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g7)g8 <- ggplot(mortal_lastyears, aes(life.exp.both, tot.fertil.rate, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(aes(group = 1), method = loess)+
theme_classic()
ggplotly(g8)Africa dominates the low life expectantion area but Europe are mostly on the high area of life expectancy . The rest are spread from the middle to the high.
Usualy the countries which infant mortality is high have less life expectantion. The infants die and the life expectantion is lower than other countries, Africa dominates this area and Europe is on the other side.
The higher fertility rate the lower life expectancy,Africa dominates this area and Europe is on the other side.
Total Fertility of the World
g1 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,life.exp.both, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g1)g2 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,inf.mort, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g2)g3 <- ggplot(mortal_lastyears, aes(tot.fertil.rate,maternal.mortality.ratio, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g3)it’s kinda make sense countries with low fertility rate have low infant mortality number.
Usualy the countries which total fertility is high have low life expectancy.
Countries with high fertility rate tend to have high maternal mortality ratio and this still dominated by African countries.
Population Increase of The world
g4 <- ggplot(mortal_lastyears, aes(pop.increase,inf.mort, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g4)g5 <- ggplot(mortal_lastyears, aes(pop.increase,tot.fertil.rate, text = rownames(mortal_lastyears))) +
geom_jitter(aes(col = Continent)) +
geom_smooth(method = loess, aes(group = 1))+
theme_classic()
ggplotly(g5)Most African countries and some Asian country have high pop increase and high infant mortality, it’s not really good though, it seems like they produce babies as much as possible but can’t really keep them alive until adult.
Some Asian countries even keep their infant mortality low but still their population increase greatly. And they are the “oil well” of the world.
The higher total fertility rate, the higher population increase.
Data Clustering
Data Scalling
mortal_lastyears_scaled <- scale(mortal_lastyears[,-6])
summary(mortal_lastyears_scaled)## inf.mort life.exp.both maternal.mortality.ratio
## Min. :-1.0369 Min. :-2.6949 Min. :-0.7804
## 1st Qu.:-0.8089 1st Qu.:-0.6741 1st Qu.:-0.7142
## Median :-0.3700 Median : 0.2131 Median :-0.4225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7055 3rd Qu.: 0.6937 3rd Qu.: 0.1243
## Max. : 2.9556 Max. : 1.4947 Max. : 3.5298
## pop.increase tot.fertil.rate
## Min. :-2.7139 Min. :-1.1726
## 1st Qu.:-0.7343 1st Qu.:-0.7470
## Median :-0.0744 Median :-0.3215
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6588 3rd Qu.: 0.6359
## Max. : 3.8116 Max. : 3.2245
Now the data has been scaled, we’re ready for clustering.
Optimal K value
wss <- function(data, maxCluster = 9) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
for (i in 2:maxCluster) {
set.seed(10)
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(mortal_lastyears_scaled)The elbow method shows that the optimum K value is 2. But i think we should try 3 as well since 2 clusters will not give us much information.
K-means
- k-means modelling
set.seed(11)
mortal_cluster2 <- kmeans(mortal_lastyears_scaled,2)
mortal_cluster3 <- kmeans(mortal_lastyears_scaled,3)- cluster distribution
table(mortal_cluster2$cluster)##
## 1 2
## 153 50
table(mortal_cluster3$cluster)##
## 1 2 3
## 42 104 57
- assigning cluster to new columns
mortal_lastyears$clust2 <- factor(mortal_cluster2$cluster)
mortal_lastyears$clust3 <- factor(mortal_cluster3$cluster)Biplotting
Performing PCA on the Data
mortal_pca <- PCA(mortal_lastyears_scaled, graph = F)
mortal_pca_pr <- prcomp(mortal_lastyears_scaled)Information gathered in each dimension
mortal_pca$eig## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.9451216 78.902432 78.90243
## comp 2 0.6727678 13.455355 92.35779
## comp 3 0.1954976 3.909951 96.26774
## comp 4 0.1354516 2.709031 98.97677
## comp 5 0.0511615 1.023230 100.00000
fviz_screeplot(mortal_pca)The dimension 1 contains 80% of information and dimension 2 contains 12% information. The total is arround 92% of information.
Variables Contribution
fviz_contrib(mortal_pca_pr, choice = "var", axes = 1)fviz_contrib(mortal_pca_pr, choice = "var", axes = 2)fviz_pca_var(mortal_pca_pr,
col.var = "contrib",
gradient.cols = c("pink", "red", "black") ,
repel = T)Cluster Plot
2 Clusters
fviz_cluster(mortal_cluster2,
data = mortal_lastyears_scaled,
labelsize = 5,
repel = T)pr <- prcomp(mortal_lastyears_scaled)
mortal_arrow <- data.frame(pr$rotation)
mortal_pr_c2 <- data.frame(pr$x,
clust = factor(mortal_cluster2$cluster),
continent = mortal_lastyears$Continent,
country = factor(rownames(mortal_lastyears)),
inf.mort = mortal_lastyears$inf.mort,
life.exp.both = mortal_lastyears$life.exp.both,
maternal.mort.ratio = mortal_lastyears$maternal.mortality.ratio,
pop.increase = mortal_lastyears$pop.increase,
tot.fertil.rate = mortal_lastyears$tot.fertil.rate)
mortal_pr_c2 <- mortal_pr_c2 %>%
mutate(text = paste('clust :', factor(mortal_cluster2$cluster),
'
continent :' ,mortal_lastyears$Continent,
'
country :' ,factor(rownames(mortal_lastyears)),
'
inf.mort :' ,mortal_lastyears$inf.mort,
'
life.exp.both :' , mortal_lastyears$life.exp.both,
'
maternal.mort.ratio :', mortal_lastyears$maternal.mortality.ratio,
'
pop.increase :', mortal_lastyears$pop.increase,
'
tot.fertil.rate :', mortal_lastyears$tot.fertil.rate))
po <- ggplot(mortal_pr_c2, aes(PC1,PC2, text = text ))+
geom_hline(aes(yintercept=0), size=.2, alpha = 0.5, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.5, linetype = 2)+
geom_point(aes(col = clust, pch = continent)) +
theme_classic()
ggplotly(po, tooltip = 'text')the countries which have low life expectancy for male and female
the countries which have high fertility rate
the countries which have high population increase
and African countries dominate this cluster.
This cluster indicates the countries contained maybe are not a healthy country since they have low life expectancy. This countries will have more young people in the future since the are high fertility rate and the population grows rapidly.
cluster 2 is :
the countries which have high life expectancy for male and female
the countries which have low fertility rate
the countries which have low infant mortality number
the countries which have low maternal mortality ratio
and all Europe countries are in cluster2.
This cluster indicates the countries contained will tend to have less productive people in the future since the fertility rate is not really good and the population is not growing well. In this case, high life expectancy will make this countries population dominated by old people one day.
3 Clusters
fviz_cluster(mortal_cluster3,
data = mortal_lastyears_scaled,
labelsize = 5,
repel = T)mortal_pr_c3 <- data.frame(pr$x,
clust = factor(mortal_cluster3$cluster),
continent = mortal_lastyears$Continent,
country = factor(rownames(mortal_lastyears)),
inf.mort = mortal_lastyears$inf.mort,
life.exp.both = mortal_lastyears$life.exp.both,
maternal.mort.ratio = mortal_lastyears$maternal.mortality.ratio,
pop.increase = mortal_lastyears$pop.increase,
tot.fertil.rate = mortal_lastyears$tot.fertil.rate)
mortal_pr_c3 <- mortal_pr_c3 %>%
mutate(text = paste('clust :', factor(mortal_cluster3$cluster),
'
continent :' ,mortal_lastyears$Continent,
'
country :' ,factor(rownames(mortal_lastyears)),
'
inf.mort :' ,mortal_lastyears$inf.mort,
'
life.exp.both :' , mortal_lastyears$life.exp.both,
'
maternal.mort.ratio :', mortal_lastyears$maternal.mortality.ratio,
'
pop.increase :', mortal_lastyears$pop.increase,
'
tot.fertil.rate :', mortal_lastyears$tot.fertil.rate))
po <- ggplot(mortal_pr_c3, aes(PC1,PC2, text = text ))+
geom_hline(aes(yintercept=0), size=.2, alpha = 0.5, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.5, linetype = 2)+
geom_point(aes(col = clust, pch = continent)) +
theme_classic()
ggplotly(po, tooltip = 'text')When we divide the data into 3 clusters, we can conclude that the cluster 1 is :
the countries which have low life expectancy for male and female
the countries which have high fertility rate
the countries which have high population increase
African countries still dominate this cluster
This cluster is not really different with the cluster 1 from the case before.
cluster 2 is :
the countries in the middle, their observation values are near the average.
there are some outliers in this cluster. they are countries with high population growing and low infant mortality, the “oil well” i’ve told you before.
cluster 3 is :
the countries which have high life expectancy for male and female
the countries which have low fertility rate
the countries which have low population increase
the countries which have low maternal mortality ratio
This cluster indicates the countries contained will more likely to have less young people than countries in other clusters. the have low pop. increase, fertility rate. These countries should be more “productive”.
Animated Plot
So we’re going to see the animated plot of each country of each cluster from 2005 to 2015. We expect to see some countries change their cluster from time to time.
Data Tidying
The data tidying is not really different from the data preparation above, but we have the year column this time.
mortal_allyears <- mortal_total %>%
select(-c(X.3,X.4)) %>%
filter(Population.growth.and.indicators.of.fertility.and.mortality %in% rownames(mortal_lastyears)) %>%
spread(key = X.1, value = X.2) %>% # create column for each indicator in X.1
rename(year = X,
Code = T03,
Country = Population.growth.and.indicators.of.fertility.and.mortality,
inf.mort = `Infant mortality for both sexes (per 1,000 live births)`,
life.exp.both = `Life expectancy at birth for both sexes (years)`,
maternal.mortality.ratio = `Maternal mortality ratio (deaths per 100,000 population)`,
pop.increase = `Population annual rate of increase (percent)`,
tot.fertil.rate = `Total fertility rate (children per women)`) %>%
mutate(inf.mort = as.numeric(as.character(inf.mort)),
life.exp.both = as.numeric(as.character(life.exp.both)),
maternal.mortality.ratio = as.numeric(as.character(maternal.mortality.ratio)),
pop.increase = as.numeric(as.character(pop.increase)),
tot.fertil.rate = as.numeric(as.character(tot.fertil.rate)))## Warning: NAs introduced by coercion
str(mortal_allyears)## 'data.frame': 615 obs. of 10 variables:
## $ Code : Factor w/ 264 levels "1","100","104",..: 2 2 2 3 3 3 4 4 4 6 ...
## $ Country : Factor w/ 264 levels "","Afghanistan",..: 36 36 36 157 157 157 38 38 38 23 ...
## $ year : Factor w/ 18 levels "2000","2001",..: 6 11 16 6 11 16 6 11 16 6 ...
## $ inf.mort : num 12.7 9.5 8.3 57.9 52.2 45 94.6 86.2 77.9 9.6 ...
## $ life.exp.both : num 72.2 73.1 74.3 62.9 64.3 66 52 53.7 56.1 67.8 ...
## $ Life expectancy at birth for females (years): Factor w/ 1031 levels "-0.1","-0.2",..: 845 857 870 701 715 739 549 569 602 818 ...
## $ Life expectancy at birth for males (years) : Factor w/ 1031 levels "-0.1","-0.2",..: 744 754 778 646 664 682 504 530 555 665 ...
## $ maternal.mortality.ratio : num 15 11 11 248 205 178 863 808 712 13 ...
## $ pop.increase : num -0.8 -0.7 -0.6 1 0.7 0.9 3 3.3 3 -0.6 ...
## $ tot.fertil.rate : num 1.2 1.5 1.5 2.9 2.6 2.3 6.9 6.5 6 1.3 ...
# adding continent
mortal_allyears$Continent <- countrycode(sourcevar = mortal_allyears[,"Country"],
origin = "country.name",
destination = "continent")
mortal_allyears <- mortal_allyears %>%
filter(Continent != "?") %>%
select( -c("Code","Life expectancy at birth for females (years)", "Life expectancy at birth for males (years)")) %>%
mutate(Continent = as.factor(Continent))
mortal_allyears %>%
is.na() %>%
colSums()## Country year inf.mort
## 0 0 17
## life.exp.both maternal.mortality.ratio pop.increase
## 16 74 5
## tot.fertil.rate Continent
## 5 0
# assigning a rownames from Country var.
mortal_lastyears$Country <- rownames(mortal_lastyears)
#creating a new dataframe to merge its cluster to the data.
country_cluster <- mortal_lastyears %>%
select(c("Country","clust2","clust3"))
mortal_allyears <- merge(country_cluster,mortal_allyears,by = "Country")#replacing NAs with avg value
mortal_allyears %>%
is.na() %>%
colSums()## Country clust2 clust3
## 0 0 0
## year inf.mort life.exp.both
## 0 17 16
## maternal.mortality.ratio pop.increase tot.fertil.rate
## 74 5 5
## Continent
## 0
life.exp.both.avg.2 <- mean(mortal_allyears$life.exp.both,na.rm = T)
life.exp.male.avg.2 <- mean(mortal_allyears$life.exp.male, na.rm = T)## Warning in mean.default(mortal_allyears$life.exp.male, na.rm = T): argument
## is not numeric or logical: returning NA
life.exp.female.avg.2 <- mean(mortal_allyears$life.exp.female,na.rm = T)## Warning in mean.default(mortal_allyears$life.exp.female, na.rm = T):
## argument is not numeric or logical: returning NA
inf.mort.avg.2 <- mean(mortal_allyears$inf.mort,na.rm = T)
maternal.mortality.ratio.avg.2 <- mean(mortal_allyears$maternal.mortality.ratio,na.rm = T)
pop.increase.avg.2 <- mean(mortal_allyears$pop.increase,na.rm = T)
tot.fertil.rate.avg.2 <- mean(mortal_allyears$tot.fertil.rate,na.rm = T)
mortal_allyears <- mortal_allyears %>%
mutate(inf.mort = replace_na(inf.mort,inf.mort.avg.2),
life.exp.both = replace_na(life.exp.both,life.exp.both.avg.2),
maternal.mortality.ratio = replace_na(maternal.mortality.ratio,maternal.mortality.ratio.avg.2),
pop.increase = replace_na(pop.increase,pop.increase.avg.2),
tot.fertil.rate = replace_na(tot.fertil.rate,tot.fertil.rate.avg.2))
mortal_allyears %>%
is.na() %>%
colSums()## Country clust2 clust3
## 0 0 0
## year inf.mort life.exp.both
## 0 0 0
## maternal.mortality.ratio pop.increase tot.fertil.rate
## 0 0 0
## Continent
## 0
#eliminating some rows
cb2 <-
mortal_allyears$inf.mort == inf.mort.avg.2 &
mortal_allyears$life.exp.both == life.exp.both.avg.2 &
mortal_allyears$maternal.mortality.ratio == maternal.mortality.ratio.avg.2 &
mortal_allyears$tot.fertil.rate == tot.fertil.rate.avg.2 |
mortal_allyears$maternal.mortality.ratio == maternal.mortality.ratio.avg.2
mortal_allyears <- cbind(mortal_allyears, cb2)
mortal_allyears <- mortal_allyears %>%
filter(cb2 == F)
mortal_allyears <- mortal_allyears[,-11]
str(mortal_allyears)## 'data.frame': 541 obs. of 10 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
## $ clust2 : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
## $ clust3 : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
## $ year : Factor w/ 18 levels "2000","2001",..: 6 11 16 6 11 16 6 11 16 6 ...
## $ inf.mort : num 89.5 76.7 68.6 21.1 16.8 ...
## $ life.exp.both : num 56.9 60 62.3 74.8 75.6 77.7 71.5 73.9 75.3 50 ...
## $ maternal.mortality.ratio: num 821 584 396 30 30 29 148 147 140 705 ...
## $ pop.increase : num 4.4 2.8 3.2 -0.3 -0.9 -0.1 1.3 1.6 2 3.5 ...
## $ tot.fertil.rate : num 7.2 6.4 5.3 1.9 1.6 1.7 2.4 2.7 3 6.6 ...
## $ Continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...
Plotting
mortal_allyears_scaled <- scale(mortal_allyears[,-c(1,2,3,4,10)])
summary(mortal_allyears_scaled)## inf.mort life.exp.both maternal.mortality.ratio
## Min. :-1.0953 Min. :-2.7946 Min. :-0.7531
## 1st Qu.:-0.8480 1st Qu.:-0.6549 1st Qu.:-0.7024
## Median :-0.3681 Median : 0.2701 Median :-0.5165
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7282 3rd Qu.: 0.7158 3rd Qu.: 0.4170
## Max. : 2.8433 Max. : 1.5739 Max. : 3.2301
## pop.increase tot.fertil.rate
## Min. :-2.50753 Min. :-1.2305
## 1st Qu.:-0.66838 1st Qu.:-0.7849
## Median :-0.07722 Median :-0.3393
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.57962 3rd Qu.: 0.6791
## Max. : 8.46171 Max. : 2.9707
summary(mortal_lastyears_scaled)## inf.mort life.exp.both maternal.mortality.ratio
## Min. :-1.0369 Min. :-2.6949 Min. :-0.7804
## 1st Qu.:-0.8089 1st Qu.:-0.6741 1st Qu.:-0.7142
## Median :-0.3700 Median : 0.2131 Median :-0.4225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7055 3rd Qu.: 0.6937 3rd Qu.: 0.1243
## Max. : 2.9556 Max. : 1.4947 Max. : 3.5298
## pop.increase tot.fertil.rate
## Min. :-2.7139 Min. :-1.1726
## 1st Qu.:-0.7343 1st Qu.:-0.7470
## Median :-0.0744 Median :-0.3215
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6588 3rd Qu.: 0.6359
## Max. : 3.8116 Max. : 3.2245
pr.2 <- prcomp(mortal_allyears_scaled,scale. = F)
pr.3 <- PCA(mortal_allyears_scaled, graph = F, scale.unit = F)mortal_allyears_clust <- data.frame(pr.2$x,
mortal_allyears)
mortal_allyears_clust <- mortal_allyears_clust %>%
select(c(1,2,6,7,8,9,15)) %>%
filter(year %in% c(2005,2010,2015))
mortal_allyears_clust$year <- as.character(mortal_allyears_clust$year)
mortal_allyears_clust$year <- as.numeric(mortal_allyears_clust$year)
str(mortal_allyears_clust)## 'data.frame': 541 obs. of 7 variables:
## $ PC1 : num -4.8 -3.38 -2.48 1.48 1.8 ...
## $ PC2 : num -0.674 -0.027 -0.489 0.799 1.122 ...
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
## $ clust2 : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 1 1 2 ...
## $ clust3 : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...
## $ year : num 2005 2010 2015 2005 2010 ...
## $ Continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 4 4 4 1 1 1 1 ...
- 2 Clusters
ggplot(mortal_allyears_clust, aes(-PC1,-PC2)) +
geom_hline(aes(yintercept=0), size=.2, alpha = 0.8, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.8, linetype = 2)+
geom_point(aes(col = clust2, pch = Continent))+
theme_classic() +
transition_time(year,range = c(2005,2015)) Some countries are moving from cluster 1 to cluster 2.
ggplot(mortal_allyears_clust, aes(PC1,PC2)) +
geom_hline(aes(yintercept=0), size=.2, alpha = 0.8, linetype = 2) +
geom_vline(aes(xintercept=0), size=.2, alpha = 0.8, linetype = 2) +
geom_point(aes(col = clust3, pch = Continent))+
theme_classic() +
transition_time(year,range = c(2005,2015)) There are some countries change their cluster.
The cluster position are flipped, its because the “var” plot is different
fviz_pca_var(pr.2) #the animatedfviz_pca_var(mortal_pca_pr) #the 2015 data They’re flipped 180 degrees for each arrow, so the information gained from animated plot is still valid anyway.
Recommendation
Based on the previous analyst, I recommend to use 3 cluster because it gives us some more information. The 2 cluster is too general while the 3 cluster is more specific.
The use of 2 cluster only give us information that there are 2 groups of country, the first which have high life expectancy, low fertility rate, and low pop. increase. and the other one is the opposite.
But when we use 3 cluster we can see the middle cluster between the extremes.